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Abstract 


The article derives some novel independence measures and contrast functions for Blind 
Source Separation (BSS) application. For the k th order differentiable multivariate func¬ 
tions with equal hyper-volumes (region bounded by hyper-surfaces) and with a constraint 
of bounded support for k > 1, it proves that equality of any k th order derivatives implies 
equality of the functions. The difference between product of marginal Probability Density 
Functions (PDFs) and joint PDF of a random vector is defined as Function Difference 
(FD) of a random vector. Assuming the PDFs are k th order differentiable, the results on 
generalized functions are applied to the independence condition. This brings new sets of 
independence measures and BSS contrasts based on the ZA-Norm, p> 1 of - FD, gradient of 
FD (GFD) and Hessian of FD (HFD). Instead of a conventional two stage indirect estima¬ 
tion method for joint PDF based BSS contrast estimation, a single stage direct estimation 
of the contrasts is desired. The article targets both the efficient estimation of the proposed 
contrasts and extension of the potential theory for an information field. The potential 
theory has a concept of reference potential and it is used to derive closed form expression 
for the relative analysis of potential held. Analogous to it, there are introduced concepts of 
Reference Information Potential (RIP) and Cross Reference Information Potential (CRIP) 
based on the potential due to kernel functions placed at selected sample points as basis in 
kernel methods. The quantities are used to derive closed form expressions for information 
held analysis using least squares. The expressions are derived through multiplicative kernel 
basis in two ways: (a) basis placed at the selected paired sample points (b) basis placed 
at the selected paired or un-paired sample points. The expressions are used to estimate 
L 2 -Norm of FD and L 2 -Norm of GFD based contrasts. Often, the performance of kernel 
based estimation methods is affected by the choice of a suitable bandwidth parameter. Usu¬ 
ally, the choice of a bandwidth parameter is a compromise between accuracy of estimation 
and computation. The article uses data dependent Extended Rule-of-Thumb (ExROT) 
for bandwidth selection that balances both accuracy and computation. The higher order 
cumulants based ExROT helps achieving parameter free estimation method. Finally, the 
contrasts are verified for source separation by obtaining their optimization landscapes for 
two sources on varying distributions. 


Keywords: Blind Source Separation (BSS), Independent Component Analysis (ICA), 
Independence Measure, Contrast function, Information Potential (IP), Reference Informa¬ 
tion Potential (RIP), Least Squares, Kernel Methods, Gradient of the Function Difference 
(GFD) 
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1. Introduction 


Contrast functions or simply contrast^] are the optimization functions to assure blind sep¬ 
aration of unobserved sources from the available observation mixtures, when maximized. 
The independence definition, its various interpretations and their approximations are used 
to derive contrasts. 

The initial phase of research on BSS contrasts focused on the Shanon entropy and 
Kullback-Leibler divergence (KLD) based information theoretic independence interpreta¬ 


tions and their approximations through higher order statistics (Cover and Thomas, 1991 


Papoulis, 1991). The other significant group of contrasts came from non-Gaussianity in- 

More 


2001 ) 


terpretations of independence and their approximations (Hyvarinen et al. 
details on these widely used, conventional contrast functions can be found in (Cardoso 


1999 

Pham 

2001, 

Hyvarinen 

1997 


The research towards new contrasts for BSS has the following motivations. 

1. More accurate BSS solution seems an everlasting hunger. So, just out of mathematical 
vigor to search for a more accurate solutions, new contrasts are always of interest. 


2. The Shanon entropy based contrasts are found to have spurious local optima (Boscolo 


et ah, 2004 Vrins and Verleysen, 2005 Pham and Vrins, 2005). Therefore, the contrast 


functions without the existence of spurious local optima are desired. 

The large scale in BSS requires balancing accuracy with computation. This has mo¬ 


tivated direct and fast estimation methods to derive contrasts (Pham 
Suzuki and Sugiyama 2011|). 


2003. 2004 


4. 


Some BSS contrasts with their estimation methods are biased towards a parametric 
family, say, subgaussian or supergaussian. To achieve unbiased estimation of sources, 
the focus has shifted to BSS using kernel based non-parametric estimation of various 


independence measures, as in Nonparametric ICA (NPICA) (Boscolo et ah 
kernel ICA (kICA) ( |Bach and Jordan 2003). 


2004) and 


5. 


The use of ‘prior’ information with the independence assumption may find better 
estimations of the actual sources. Therefore, the contrast functions incorporating more 
generalized priors without violating the blind assumptions, other than the application 
specific priors used in Bayesian approach for BSS and semi-BSS problems, are of 
interest. The bounded support assumption is one of such assumptions, used by many 


geometry based ICA and BSS algorithms (Theis et ah, 2003 Vrins et ah, 2007) 


Overall, the contrasts giving more accuracy at low computation, blind and without local 
minima are still in demand and open for further research. 

To overcome this demand, the latest trend in BSS contrasts follows two directions. 

1. Other than the conventional Shanon entropy and KLD as a divergence measure be¬ 
tween two PDFs, there exists many alternative definitions and interpretations of en¬ 


tropy, PDF distance measures and independence interpretations (Ullah 1996. Principe 


2010; Seth et ah, 2011). Inspired by the above motivations, the research community 


1. The formal definition is in 


Section [H] 
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has started focusing on these alternatives to derive new BSS contrasts (Bach and 


Jordan, 2003 Learned-Miller and John III, 2003). 


2. The new independence interpretation should be incorporated with kernel based fast 
and nonparametric estimation technique to derive new BSS contrast. 

Combining both the above directions, the latest trend is to use quadratic measures of 


independence for BSS. The article by Achard et al. (2003) uses L 2 distance between the 


transformed characteristic functions of joint and product of the marginal PDFs. The Infor¬ 
mation Theoretic Learning (ITL) suggests many such quadratic independence measures, for 
example, Renyi’s Entropy, Cross Information Potential (CIP), Euclidean distance ( Ded ) 
based and Cauchy-Schwarz distance ( Dcs ) based Quadrature Mutual Information (QMI) 
(Principe, 2010; Kenneth E. Hild and Prncipe, 2001). The article by Seth et al. ( 201 1[ ) 


provides ITL based unified framework to those quadratic distance measures and proposes 
a new parameter free distance measure for ICA. 

The current article is inspired by all the above motivations and follows the latest trend. 
It derives some new independence interpretations relating gradient of the PDFs, specifically 
for bounded support random variables, and proposes new BSS contrasts. It achieves their 
nonparametric estimation with reduced computation by using least squares based direct 
estimation approach. The suitable choice of a kernel bandwidth parameter using data de¬ 


pendent bandwidth selection Extended Rule-of- Thumb by (Dharmani Bhaveshkumar 2015) 
achieves a parameter free contrast estimation. 

There have been proved some results for generalized differentiable multivariate functions. 
Looking PDFs as a generalized functions, the results are applied on independence of random 
vectors. The results are: 1) The equality of the gradient of joint probability density function 
(PDF) and the gradient of product of the marginal PDFs imply independence. 2) The 
equality of the Hessian of joint PDF and the Hessian of product of the marginal PDFs 
imply independence, if the prior given that the random vector has bounded support i.e. its 
probability outside certain region is zero. These new independence interpretations are used 
to derive new independence measures and contrast functions for BSS. The bounded support 
condition is not very restricting. The reasons are: 1) Empirically, the sampled region is 
always bounded. 2) Numerically, the computers always work with definite range. Though 
may not be always, it might be a valid approach in most cases to blindly consider PDF 
outside the bounded sampled region to be zero. To achieve nonparametric estimation of 
the newly derived contrasts, there has been derived single stage direct estimation method 
using least squares. To take the advantage of the quadratic nature of the contrasts, there 
are defined concepts of Reference Information Potential (RIP) and Cross RIP (CRIP) that 
depend upon IP due to selected kernel basis. The concepts are used to achieve closed form 
expressions for information field analysis. The derived closed form expression are verified 
by applying them to obtain L 2 -Norm of FD and L 2 -Norm of GFD contrasts. The method 
uses Gaussian kernels as basis and has two variations. One, the basis are placed at the 
selected paired sample points only. Another, the basis are placed at selected sample points 
may be paired or unpaired. 

The next Section [2] derives some results for generalized multivariate differentiable func¬ 
tions with bounded support. The results are applied to statistical independence condition 
in Section [3j To better exploit the results, it derives new definitions and their important 
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properties. Corresponding to that, the new independence measures are derived in Section]!} 
The Section [5] briefs the BSS problem and the possible approach for solution. The previous 
results are used to derive new BSS contrasts; satisfying the important properties of Scale 
invariance, Dominance and Discrimination; in Section [6]. There is also done local minima 
analysis of the derived contrast. The next Section [7] discusses the contrast function estima¬ 
tion approaches and derives prerequisites of Kernel Theory and Information Potential (IP). 
The Section [8] defines the Reference IP (RIP) and related concepts. Then, the Section [9] 
derives the least squares based closed form expression for information field analysis. The 
expressions are used to derive FD based estimators LSFD and LSFD2 in Section 10 and 


GFD based estimators LSGFD and LSGFD2 in Section 11 The Section 12 reports empiri 


cal verification of the derived independence measures and BSS contrasts. The Section 12.1.1 


provides important discussion on required parameter selection for the derived estimators. 


Finally, the article ends with conclusion in Section 13 


2. Some Results On the Equality of Generalized Constrained 
Multivariate Functions 

Definition 1 A function f : M n -» /(M n ) is said to have support IZ if /(x) = 0,Vx 6 1Z', 
where, 1Z C and 1Z' is its complement set. It is represented as supp(f) = 1Z. Any 
superset oflZ is also a support. If TZ is bounded above and bounded below then f is a said 
to be a bounded support function. 

Let Conv(Tvl) be the convex hull of IZ that contains all convex combinations of points in IZ. 
Then, the definition says that for the bounded support functions both the support IZ and its 
convex hull Conv(77) have finite measures. If IZ is convex, both the support measure ( l(IZ )) 
and its range (l(Conv(IZ))) are same, where l is the length of an interval. For example: let 
IZ = [—1,1], Then, the support measure l(IZ ) and the range Z(Conv(7£)) = 2. Now, let 
IZ = [—1,1] |J(2,4] \ 3. Then, 1{IZ) is 4. But, the Conv(7£) is [—1,4] and Z(Conv(7vl)) = 5. 

For differentiable multivariate functions with equal hyper volumes (region bounded by 
hyper surfaces) the following results are derived. For some of the results, an added constraint 
of random vector having bounded support is required. 

Theorem 2 Let f : W 1 M n —> M and both satisfy the following conditions: 

1. They are differentiable. 

2. J Rn f (x)dx = f Rn g(x)dx, x G K n 
Then, the following holds: 


V/(x) = Vg(x) => /(x) = g(x) (1) 

Proof Let us prove this Theorem by mathematical induction. 

The Base Case: n = 1 

Given f( x ) dx = Iru g(x)dx and £f(x) = ^ g{x ). 

Integrating both the sides of the latter equation leads to, 

f{x) = g{x) + c (2) 


4 







where, c is some arbitrary constant. 

Integrating both the sides of Equation ([2]) with respect to x from —00 to 00 , brings: 



This proves the Theorem for the base case. 

The induction step: 

Given the Theorem holds for n = k, let us prove it for n = k + 1. 

For the sake of simplicity and without the loss of generality, let us prove it for n = 3 
assuming it holds for n = 2 i.e. for k = 2. 

Accordingly, let x = (x\, x 2 , x 3 ) T . 

Given, f Rn f(x)dx = f Rn g(x)dx and V/(x) = Vg(x). 

As per the latter equation, g|^-/(x) = g(x ). 

Integrating both the sides with respect to x\ leads to: 

f(x)=g(x) + c(x 2 ,x 3 ) (3) 

where, c(x 2 , x 3 ) is some arbitrary function of x 2 and x 3 . 

Taking partial derivative of Equation ^ with respect to x 2 , we get: j^c(x 2 , x 3 ) = 0 
Taking partial derivative of Equation ([ 3 ]) with respect to x 3 , we get: jf^cfx 2 , x 3 ) = 0 
Combining the results on c(x 2 ,x 3 ), we get: c(x 2 ,x 3 ) = 0 
This proves the Theorem for n = k + 1. 

Combining both the base case and inductive step, by mathematical induction, the Theorem 
holds for all natural n. ■ 

Lemma 3 Let f : —> M, g : M n —> K and both satisfy the following conditions: 

1. They are second order differentiable. 

2- J n f(x)dx = j'. R g(x)dx, xeK", K = supp(f) |J supp{g) 

3. They have bounded support. 

Then, the following holds: 

V 2 / (x) = V 2 5 (x) =» /(x) = g(x) (4) 

Proof Let us prove this Lemma by mathematical induction. 

The Base Case: n = 1 

Without loss of generality, let I = [—a, a] A Conv(7£), a E M 

Given J I f(x)dx = /j g(x)dx and 4^f(x) = 4^g{x ).Double integrating both the sides of 
latter equation with respect to x leads to, 

f(x) = g(x) + crx + c 2 (5) 

where, ci and c 2 are some arbitrary constant. 

Integrating both the sides of Equation ([ 5 ]) with respect to x from —a to a, brings c 2 = 0. 
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Integrating both the sides of Equation ([5]) with respect to x from —a to b, b > a,, b E M 
brings c\ = 0. 

This proves the Lemma for the base case. 

The induction step: 

Given the Lemma holds for n = k, let us prove it for n = k + 1. 

For the sake of simplicity, let us prove it for k = 2 i.e. n = 3, assuming it holds for n = 2. 
Its generalization to k > 2 is obvious. 

Without loss of generality, let x = (xi, X 2 , x 3 ) T and I = [—a, a] 3 D Conv(77), a E M 
Given f I f(x)dx = J I g(x)dx ; V 2 /(x) = V 2 g(x). 

=4- -/(x) = 5 (x). Integrating twice both the sides with respect to xi leads to: 

/(x) = g(x) + ci(x 2 , x 3 )xi + c 2 (x 2 , x 3 ) (6) 


where, ci(x 2 ,X 3 ) and c 2 (x 2 ,X 3 ) are some arbitrary functions of x 2 and x 3 . 
Integrating Equation ( 6 ) over I, we get: f j X2 c 2 (x 2 , X 3 )dx 2 dx 3 = 0 
Integrating Equation ( 6 ) with respect to xi from —a to a, we get: 


fi(x 2 ,x 3 ) = gi (x 2 , x 3 ) + 2ac 2 (x 2 ,x 3 ) (7) 

where, /i(x 2 ,x 3 ) = /“ a /(x)dx 1 and fifi(x 2 ,x 3 ) = j\g(x)dx 1 . 

Integrating Equation 0 with respect to both x 2 and X 3 , we get: J' );j f i(x 2 , x 3 )dx 2 dx 3 = 

fx 3 1 x 2 9 i(x 2 ,x 3 )dx 2 dx 3 

Integrating ^/(x) = gf^(x) with respect to xi from -a to a, we get: /i(x 2 , x 3 ) = 

gfrsi (* 2 ,ss) 

Integrating ^r/(x) = ^rg(x) with respect to xi from -a to a, we get: -£^zfi(x 2 ,x 3 ) = 
^z9i(x 2 ,x 3 ) 

Applying, n = 2 case, with all conditions satisfied, we get: /i(x 2 ,x 3 ) = < 7 i(x 2 ,x 3 ) 
Therefore, from Equation Q, c 2 (x 2 ,x 3 ) = 0. 

Integrating the Equation ® with respect to xi from —a to b, b > a, b E M, we get: 
ci (x 2 , x 3 ) = 0 

This proves the Lemma for n = k + 1. 

Combining both the base case and inductive step, by mathematical induction, the Lemma 
for all natural n. ■ 


Lemma 4 Let f : —> M, g : M n —> M and both satisfy the following conditions: 

1. They are p th order differentiable. 

2- f n f{x)dx = J n g(x)dx, x E M n , 77 = supp(f) |J supp(g) 

3. They have bounded support. 

Then, the following holds: 


V p /(x) = V p g(x) => /(x) = g(x) 


( 8 ) 
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Proof The Theorem [2] proves this for p = 1 and the Lemma [3] proves the same for p = 2. 
Here, it needs be proved for any p > 2. 

Let us prove this Lemma by mathematical induction. 

The Base Case: n = 1 

Without loss of generality, let I = [— a±, ai] D Conv(77), a\ £ M 
Given f I f(x)dx = j l g(x)dx and £pf(x) = £pg(x). 

Integrating p times both the sides of latter equation with respect to x leads to, 

f(x) = g(x) + cix p_1 + c 2 x p ~ 2 4- Vc p (9) 

where, Cj, i = { 1 , 2 ,... ,p} are some arbitrary constant. 

We can have two cases: Let p be even. 

Integrating both the sides of Equation Q with respect to x from — a± to oi, brings anC 2 + 
01204 + ... + a\ q Cp = 0 , where q = p/2 and ai*s are the coefficients as a result of integration. 
Let there be q — 1 real numbers ai, i = {2,3,... q} such that ai > a± and each one is different 
from the other. Then, integrating (|9j) with respect to x from — ai to ai, brings over all q 
equations with coefficients ap, i = {1, 2 ,... ,q},j = {1, 2,..., < 7 }. Representing them in a 
matrix form, Ac = 0, where A = [ap], V ap / 0 and c = (c 2 , C 4 ,..., c p ) T . The only solution 
to this equation is: c* = 0 ,i = {2,4 ,p} i.e. all c ,;,i = V even in Equation [ 9 ] are zero. 
Now, let there be q real numbers bi > at,i = {1,2,... ,q} such that none of them is equal 
to the other. Integrating both the sides of Equation Q with respect to x from —a* to 
brings 6 ,;iCi + 6^03 + ... + bi q c p -\ = 0 , where q = p/2 and bij,j = {1,2,... ,q} are the 
coefficients as a result of integration. In a matrix form, Be = 0, where B = [bp] and 
c = (ci, C 3 ,..., Cp-i) 1 . This brings all c t , i = odd also to be zero. 

This proves the lemma from Equation ([ 9 ]) for p even case. 

The p odd case can also be solved similarly. 

This proves the Lemma for the base case. 

The induction step: 

Given the Lemma holds for n = k, let us prove it for n = k + 1. 

For the sake of simplicity, let us prove it for k = 2 i.e. n = 3, assuming it holds for n = 2. 
Its generalization to k > 2 is obvious. 

Without loss of generality, let x = (aq, x 2 , x?/) T and I = [—ai,ai ] 3 N Conv(7£),a E M 
Given f I f(x)dx = f I g(x)dx; V p /(x) = V p g(x). 

=P gf^/(x) = J^g(x). Integrating p times both the sides with respect to x\ leads to: 

f(x) = g(x) + c 1 (x 2 ,x 3 )xf ~ 1 + c 2 (x 2 ,x 3 )x p 1 ~ 2 + • •. + c P (x 2 ,x 3 ) ( 10 ) 


where, Ci(x 2 , x 3 ),i = {1,2 ,... ,p} are some arbitrary functions of x 2 and x 3 . 

Let p be even. 

Integrating Equation (10) over I, we get: 

f x 3 f X2 {anc 2 (x 2 ,x 3 )x p r + ai 2 cYx 2 , x 3 )x P Y 4 + ■ ■ ■ + a lq c p (x 2 ,x 3 )}dx 2 dx 3 = 0 
where q = p/2 and an are the relevant coefficients. 

Integrating Equation (10) with respect to x\ from — a\ to 01 , we get: 

fi(x 2 ,x 3 ) = gi(x 2 ,x 3 ) + a u c 2 (x 2 ,x 3 )x p 1 ~ 2 + ai 2 C 4 (x 2 , x 3 )x p f 4 + ... + ai q c p (x 2 ,x 3 ) ( 11 ) 
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where, fi(x 2 ,x 3 ) = f“ f(x)dx 1 and gi{x 2 ,x 3 ) = f“ a g(x)dx i. 

Integrating Equation (11) with respect to both x 2 and x 3 , we get: J x _ s f x> f\(x 2 , x 3 )dx 2 dx 3 = 
1 x 3 L 2 9 i(x 2 ,x 3 )dx 2 dx 3 

Integrating gfp/(x) = gf^rg(x) with respect to x x from -ai to ai, we get: ^zfi{x 2 ,x 3 ) = 
gi{x 2 ,x 3 ) 

Integrating gf^j/(x) = gfLrg(x) with respect to x\ from -ai to a±, we get: gf^/i(x 2 , x 3 ) = 

Z&5i(z2,x 3 ) 

Applying, n = 2 case, with all conditions satisfied, we get: fi(x 2 ,x 3 ) = gi(x 2 ,x 3 ) 
Therefore, from Equation © , a n c 2 (x 2 ,x 3 )x{ 2 +ai 2 c±(x 2 , x 3 )x p l 4 +.. .+a lq c p (x 2 , x 3 ) = 0 . 
Similar to the n = 1 case, we can form q — 1 such other independent equations, solving 
them we get: c* = 0 .\/i even 

Integrating the Equation (10) with respect to x\ from —a 3 to 61 , b\ > ai,&i G M, we get: 

&iici(x 2 ,x 3 )^ _1 + 6i2C 3 (x 2 ,x 3 )x^ 3 + ... + aigCp_l(x2,X 3 ).Tl = 0 

Similar to the n = 1 case, we can form q — 1 such other independent equations, solving 

them we get: c* = O.Vi odd 

This proves the Lemma for n = k + 1. 

Combining both the base case and inductive step, by mathematical induction, the Lemma 
for all natural n. ■ 


Lemma 5 Let f : —> M, g : M n —> M and both satisfy the following conditions: 

1. They are p th order differentiable. 

2 - In f( x ) dx = In9( x ) dx ^ X G M n , n = supp{f) U supp(g) 

3. They have bounded support. 

Then, the following holds: 

fix) = g{x) & V p f (x) = V^(x) (12) 

Proof Given /(x) and g(x) are differentiable: /(x) = g{x) => V p /(x) = X7 p g(x ) 

The converse part is proved in Lemma |4j This proves the current Lemma. ■ 

Theorem 6 Let f : M n —> R, g : W 1 —> M and both satisfy the following conditions: 

1. They are p th order differentiable. 

2 - In f( x ) dx = Ind( x ) dx ^ x G RTl ’ 7Z = su PP(f ) U supp(g) 

3. They have bounded support. 

Then, the following holds: 


f{x) = g{x) & Vfix) = Vg(x) <S> ... V p /(x) = V p gix) 


(13) 












Proof Applying principle of transitivity of implication (Hypothetical syllogism) to Lemma 
[5] with varying values of p, this Theorem is proved. ■ 

For a generalized functions, given any p th order derivatives are equal, the only available 
information would be that the functions differ by a constant in their (p— l) th order deriva¬ 
tive. It would require p initial conditions to decide about equality of the functions. The 
Theorem [2] proves that if the given condition for p = 1 is added with one more condition of 
equal hypervolumes then it brings equality of the functions. The above Theorem [6] proves 
further the strength of an added prior information that the function is also having bounded 
support. This prior implies that any p th order derivatives are equal, the functions are equal. 
Conversely, given two functions with equal p th derivative are not equal imply either of the 
conditions are not matching. For example; let f(x) and g{x) are constant functions with 
unequal constant values and unequal supports on real line such that area under them are 
same. The derivatives are same and zero everywhere. The example seems counterexample 
of the Theorem [2] as both derivatives are same but not the functions. More better observa¬ 
tion clears that both the functions are discontinuous at boundary points. This violates the 
differentiability condition of Theorem. The derivative values given zero, actually excludes 
points with Lebesgue measure zero. 


3. Applications of the Results On Independence 


By definition, the probability density functions have area under the curve to be unity. The 
bounded support function assumption seems restricting application to many PDFs. But, 
as said in the Section [lj empirically and numerically this assumption is not restricting. So, 
it is natural to think of extending the previous results to independence condition. Looking 
similarity with the results on Score Function Difference (SFD) and its properties related to 
independence in (Babaie-Zadeh 20 septembre 2002), the topic is developed using matching 
terminology. 

Let x = (xi,X 2 , • ■ ■, x n ) T is an n-dimensional random vector, where, Xj, i = 1,2,..., n are 
random variables; p x (xi,X 2 ,... , x n ) is the joint PDF of x and fl/Li Px l ( x i) is the product 


of the marginal PDFs. For this description, the statistical independence as in (Papoulis 


1991), and other terms are defined. 


Definition 7 (Statistical Independence) The random variables X\,X 2 , ■ ■ ■ ,x n are said 
to be statistically independent, if 


n 

Px(x 1 ,X 2 , • ■ ■ ,x„) = W_P Xi (.Xi) 

2=1 

As the stastical independence finds many applications, it is worth defining the following 
term. 


Definition 8 (Function Difference (FD)) The Function Difference (FD) ofx is the dif¬ 
ference between product of its marginal PDFs n” = i Pxi{xi) and its joint PDF p x (x 1 , X 2 ,..., x n ), 
that is: 

n 

A(x) d = \\PxiiXi) ~p x (x l,X 2 , • • • ,X n ) 

2=1 
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From the Definition, A(x) = 0 implies independence. 

With the assumption that the joint PDF and the marginal PDFs are differentiable, the 
followings are defined. 

Definition 9 (GPF) The Gradient of the Product Function (GPF) ofx is the gradient of 
the product of the marginal PDFs YYl = iPxi{xi), that is: 


C(x) 


d g v 


Y[p Xi (Xi) 


\i=1 


where, fi(xi) 


def 


_d_ 

dxi 


= (£l(Zl), 6(^2), • • • ,f,n(x n )) T 


Definition 10 (GJF) The Gradient of the Joint Function (GJF) of x is the gradient of 
the joint PDF p x (x i,x 2 ,..., x n ), that is: 


def 


C(x) = Vp x (xi,X2,...,X n ) = (Cl(x),C2(x),...,Cn(x)) J 

where, Q{x) d = -J^-pgxi, x 2 , ...,x n ) 
oxi 


Definition 11 (GFD) The Gradient Function Difference (GFD) of x is the difference 
between its GPF and GJF or equivalently it is the gradient of FD, that is: 


«(x) = £(x) - C(x) = V(A(x)) 


The following property proves that GFD («(•)) contains important information about in¬ 
dependence of the components of a random vector. 

Property 1 The components of a random vector x = {x\, x 2 , ■ ■ ■, x n ) T are independent if 
and only if ct(x) = 0, that is: 

*(x) = C(x) (14) 

Proof Let us take for better analogy, /(x) = p x (^i)^ 2 , ■ ■ • ,%n) and g(x) = n£i P Xi (%i) 

=> f K f(x)dx = f n g(x)dx = 1 

Given £(x) = £(x) or V/(x) = 'Vg(x). 

The required conditions are satisfied. So, applying Theorem [2] and the Definition [7] of 
Independence, the property is proved. ■ 

For the same random vector x with added assumptions that the joint PDF and the product 
of the marginal PDFs are both second order differentiable and have bounded support, the 
following definitions and results are obtained. 

Definition 12 (HPF) The Hessian of the Product Function (HPF) of x is the Hessian of 
the product of the marginal PDFs YYi^Px^i), that is: 

E(x) d = V£(x) = V 2 
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Definition 13 (HJF) The Hessian of the Joint Function (HJF) o/x is the Hessian of the 
joint PDF Px(xi,X 2 , • • •, x n ), that is: 

Z(x) d = VC(x) = V 2 p x (xi,x 2 , • • • ,x n ) 

Definition 14 (HFD) The Hessian Function Difference (HFD) of x is f/ie difference be¬ 
tween its HPF and HJF, or equivalently it is the Hessian of FD, that is: 

A(x) d = S(x) - Z(x) = V (£(x) - C(x)) 

= V 2 (A(x)) = Vq(x) 


The following property proves that HFD (A(-)) contains important information about in¬ 
dependence of the components of a random vector. 

Property 2 The components of a bounded support random vector x = (x\,X 2 , ■ ■ ■ ,x n ) T 
are independent if and only if A(x) = 0, that is: 


E(x) = Z(x) 


Proof Applying Lemma [4] with p = 2, the property is proved. 


(15) 


Corollary 15 Letx = (x\, X 2 , ■ ■ ■, x n ) T be an n-dimensional random vector; p x (x\, X 2 , ■■-,x n ) 
be its joint PDF; n )=i Pxi(xi) be its product of the marginal PDF; the PDFs be second order 
differentiable with bounded support 1Z C M n . Then: 

n 

Px(xi,x 2 , ...,x n ) = Y\p Xi {xi) 44 £(x) = C(x) 44 S(x) = Z(x) (16) 

i =1 

Proof Applying Theorem [6] and the Definition [7] of independence, the corollary is proved. 


The Property [l] of GFD, Property [2] of HFD and the Corollary 15 bring further interpre¬ 
tations on independence of bounded support random vector. Our goal is to develop new 
contrasts based on them. For that the quantities should be nonnegative to be quantified as 
measures. So, first let there be derived independence measures based on these results. 


4. Deriving new Independence Measures 

The goal here is to derive independence measures based on the quantities FD, GFD and 
HFD. But, the quantities do not assure nonnegativity to be quantified as measures. Assum¬ 
ing a class of L p integrable PDFs, the L p norm can be applied on them. Being norm, they 
satisfy all the properties of a metric and an added property of absolute scale invariance, as 
per the definition of norm. The details on the definitions of a measure, a metric, a norm 
and the specific L p -norm are briefed in Appendix [A} 

It is desired that a distance measure between PDFs is invariant with respect to 
translation and scaling i.e. the deviation in mean and the variance should not affect the 
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distance measure. The reason is, the nearness of the PDFs should imply their shapes are 
matching. The desired property of scale invariance, instead of the absolute scale invariance, 
can be assured by defining an independence measure that applies a norm on normalized 
PDFs i.e. converting them first into zero mean, univariance PDFs. 


Proposition 16 For a random vector x E with L p integrable joint and marginal PDFs, 
LpFDfx) or A p (x) defined as under is an independence measure. 


A p (x) d = ||A(z)|| p =( f |A(z)| p dz\ 
\JR n J 



(17) 

(18) 


where, z = x and x. L are consecutively mean of x and Xi, <r x and a Xi are corre¬ 

sponding standard deviations. 


Proof By definition, A p (x) > 0 and A p (x) = 0 if and only if A(x) = 0. Also, by Defini¬ 
tion [7] of independence, A p (x) = 0 if and only if the components of x are independent. 
This proves that A p (x) is an independence measure. More specifically, it is an indepen¬ 
dence metric with respect to A(x), but not necessarily on the space of random vectors x 
themselves. ■ 


The GFD is essentially a vector, whose value is an n-tuple of functions. Accordingly, 
a:L ? xL p x...l7->K. So, L p -norm can still be applied as under. 


Proposition 17 For an n-dimensional random vector x = {x\, X 2 , ..., x n ) with differen¬ 
tiable joint and marginal PDFs, LpGFDfe) or a p (x) defined as under is an independence 
measure. 


a , 


(x) d = ||a(z)|| p = ( ^(||ai(z)|| p ) p 


^ 2—1 


or d p (ffx), C(x)) 




(19) 

( 20 ) 


where, z 



x is the mean of x and cr x is the corresponding standard deviation. 


Proof The differentiable PDF condition, assures L p integrability. 

By definition, a p (x) > 0 and a p (x) = 0 if and only if a(x) = 0. Applying Property [I] 
a p (x) = 0 if and only if the components of x are independent. 

This proves that cx p (x) is an independence measure. More specifically, it is an indepen¬ 
dence metric with respect to a(x), but not necessarily on the space of random vectors x 
themselves. ■ 


The HFD is essentially a matrix. So, matrix norms are applicable. The ‘ Entrywise’ norms 
treat matrix entries as a vector entries. The following independence measure can be defined. 
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Proposition 18 For a bounded support random vector x = (x\, xi, xn) with second 
order differentiable joint and marginal PDFs, LpHFDfx.) or A p is an independence measure, 
where: 


A p — ||j4(x)|| p — |££(||A;(x) 

j=\ 1=1 


or 


dp (S(x), Z(x)) 

j =1 i =1 ^ Xi i 






X — X \ /x — X 

— A 






dx 




( 21 ) 

( 22 ) 


where, x is the mean of x and cr x is the corresponding standard deviation. 

Proof The second order differentiable PDF condition, assures L p integrability. 

By definition, A p > 0 and A p (x) = 0 if and only if A(x) = 0. Applying property[2j A p = 0 
if and only if the components of x are independent. 

This proves that A p is an independence measure. More specifically, it is an independence 
metric with respect to A(x), but not necessarily on the space of random vectors x them¬ 
selves. ■ 


5. The Linear BSS Problem and Solution 

The Blind Source Separation (BSS) model explains generation of an observed random 
vector x(t), as an transformation to another latent (hidden) random vector s (t). As¬ 
suming linear and instantaneous transformation, mathematically, x(f) = As (t), where 
x(i) = [xi (t);x 2 (t);...-,x m (t)]; s (t) = [si(t); s 2 (t );...; s n (t)]; Xi(t), Sj(t) are random vari¬ 
ables with values in 1Z; m = n >= 2 and A is full rank. Let there be available N umber of 
samples of each observed random variable. Assuming an identicle distribution, the instan¬ 
taneous model can be extended for N realizations. Let X(t) = [xi(f); X 2 (t);... ;x m (f)] be 
the m x N data or observation matrix and S(f) be the n x N component or source matrix. 
Then, 

X(f) = AS (t) (23) 

The problem of BSS is to estimate both the unknowns A and S(f), with the only assumption 
of s i(t) being mutually the most independent possible (m.i.p.) random variables with respect 
to a given contrast. If W is the estimated inverse of the mixing matrix A then the estimated 
source or component matrix Y(i) is: 

Y(t) = A _1 X(f) = WX(f) = WAS(t) (24) 


As, X(t) = AS (t) = (AA _1 P^ 1 )(PAS(t)), for any permutation matrix P and a scaling 
matrix A, there are going to be scaling and permutation ambiguities in the estimated 
components. 

Given the unknown sources are independent and identically distributed ( i.i.d.) with 
maximum one of them being Gaussian, a unique BSS solution is assured by Darmois- 
Skitovtch Theorem (Comon and Jutten, 2010 Comon 1994 Eriksson and Koivunen, 2004). 
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Accordingly, the BSS solution for linear, instantaneous mixing system can be obtained by 
maximizing the independence among yfit )s with respect to the separation matrix W, as: 

y *(t) = argmax <h(y(t)) (25) 

w 

where, 3>(y(t)) is the optimization function, based on independence or dependence measure, 
that assures source separation on maximization. It is identified as a contrast function or 
simply a ‘contrast’. Oerall, the BSS solution demands a suitable contrast function as an 
optimization criteria and a suitable optimization technique corresponding to that contrast 
function. 


6. Deriving New Contrasts for ICA and BSS 


A formal definition of contrasts, based on references (Cornon and Mourrain, 1996) and 
(Cornon and Jutten 2010, Chapter 3), for BSS is as under. 


Definition 19 (Contrast for BSS) LetP be a set of static transformations (filters) con¬ 
taining an identity transformation (filter) I; S be a set of source random variables that are 
independent and ; X = P ■ S be the set of random variables obtained by the action of P on 
S; be a mapping from P x P ■ S to R. Also, denoted by T the set of trivial filters ofP, 
which leave criterion <f> unchanged. A mapping <h(H;x) is a contrast if it depends solely on 
the PDF of x and if it satisfies the following three properties below. 


a. Invariance: Vx E A, VT E T, <b(T;x) = <f>(I;x) 

b. Dominance: Vs E S, VH E P, 4>(H; s) < 4>(I; s) 

c. Discrimination: Vs E S, if H E % satisfies 
$(H;s) = $(I;s), then H E T 


The Dominance property assures that the actual sources have the global maxima. The 
Discrimination property assures that there is no other spurious solution achieving the global 
maxima. There is some discussion needed on the invariance property. It is obvious that the 
independence components found using a given measure, are still independent if permuted 
or scaled. So, one of the solutions is available, whole class of solutions related through 
permutation and scaling operation is available. The Invariance property assures this by 
stating that whole class should have a same measure. The widely used KL-divergence 
assures this property. But, it is known that many other PDF divergence measures such 
as; Itakura-Saito distance, density-power divergences do not assure this scale invariance 
property. To accommodate such a larger class of divergences, without deteriorating the 
BSS performance, there has been first defined and then proposed relative scale invariance 
property as the sufficient property with other properties to be quantified as contrast. 

Definition 20 The contrast <& : P x P ■ S —>■ M is said to have relative Scale Invariance 
property; if it satisfies the following condition: Given y = Ax 


<I>(y) = k{ A)$(x) 

where, k{ A) is a fixed transformation as a function of A. 
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Proposition 21 Q: is a contrast for linear BSS, if it satisfies the Relative 

Scale Invariance property with other required properties satisfied. 

Proof The following arguments justify the proposition. 

• Given T £ T is a scale matrix with diagonal entries only. As the source components 
are independent, d*(s) = 0. From the definition of the relative scale invariance prop¬ 
erty, /c(T) is a predefined transformation acting as a scaling factor. So, <F(y) = 0. 
=> VT £ T, $(T; s) = <f>(I; s) = 0 

As per this argument, scale invariance is required corresponding to the source com¬ 
ponents s and not necessarily with respect to x. This is satisfied by the contrasts 
measuring 0 corresponding to independence and satisfying relative scale invariant. 

• By definition, the relation between the measures corresponding to x components and 
their scaled version Tx components is known. VT £ T,$(T;x) = A:(T)$(I;x) 

This assures the contrast measure for whole equivalence class of solutions are known. 

• For the most BSS algorithms or precisely the orthogonal approach BSS algorithms 
y = Wx, where W is the estimated unmixing orthogonal rotation transformation 
and x are the equivariant uncorrelated (whiten) components. This implies that the 
measure is applied on the solution set that is equally scaled. Mathematically, y = Ax, 
but A is a constant for the whole solution set. Also, corresponding k( A) is constant 
for the whole solution set. 


Though the relative scale invariance property is sufficient for a quantity to be a contrast, in 
most of the cases the quantity can be easily converted into a scale invariant quantity. This 
has been demonstrated for L p norm of FD, GFD and HFD distance measures. Now, let us 
verify whether the derived independence measures qualify to be a contrast or not. 

Proposition 22 $p D or$£ :H x H ■ S —> M is a contrast for linear BSS, where: 
<I>™(H;x) or (H;x) = $*(y) *= -A p (y) = -d p (\\_P yi {yi),P y {y) 

\i =1 


Proof Let us verify the scale invariance property of the contrast for both without and 
with normalization. Let T £ T be n x n diagonal scaling matrix, as a trivial filter, with 
the non-zero diagonal entries tj, i = 1,..., n. 


PTx(tlXl,t 2 X2, • • • , t n x n ) 

P(Tx)i((Tx)i) 

N 

^ H P x i 

i =1 




i 

]detT\ 


N 

n Px^i) 

i=i 
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A(y)|| P 


Now, (y) 


-Ap(y) = -|| 

- / 


i= 1 


dy 


1 


det T 


Px(x) - 


2=1 


— |detT| p A p (x) 


det T|dx 


1 

P 


This proves that the contrast <jA(y), without normalization of PDFs, is scale invariant 
for p = 1 i.e. corresponding to L 1 -norm of A. It assures relative scale invariance for 
1 < p < oo. As already discussed either the relative scale invariance is a sufficient condition 
or the measures are applied on normalized densities (i.e. densities with zero mean and unit 
variance) the scale invariance property is satisfied. Corresponding to normalized density, 
ti = 1, Vi = 1, 2,. .., n. 

The permutation invariance can be proved in a same way as | detTj = 1. 

The Proposition |17| proves the Dominance property. 

By Definition [7J A p (y) = 0 if and only if the components y = Hs are independent. So, H 
should be a trivial filter in T. This proves the Discrimination property. ■ 

Similarly, let us now verify whether the GFD is qualified to be a BSS contrast or not. 


Proposition 23 or d>“ : H x H ■ S 


is a contrast for linear BSS, where: 


^GFD { H;x) or <h“(H;x) = $“(y) = f -a p (y) = -d p (£ y (y), C y (y)) 


Proof Let us verify the scale invariance property of the contrast for both without and 
with normalization. Let T E T be n x n diagonal scaling matrix, as a trivial filter, with 
the non-zero diagonal entries ti,i = 1 ,,n. 

To simplify, let us start with the gradient of one dimensional transformed variable. 


Y = aX =>p Y (y) 
dp Y {y) 


dy 
dpvjy) 
dy 


dy 




dx 
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Let y 


=> T 


GFD 

p 


Tx. 



1 

P 


This proves, a p (y), without normalization, is neither scale invariant nor relative scale invari¬ 
ant. So, without normalization it can not be a BSS contrast, though being an independence 
measure. 

But, as already discussed the measures are applied on normalized densities i.e. densities 
with zero mean and unit variance, the scale invariance property is satisfied. Corresponding 
to normalization, ij = 1, Vi = 1, 2 ,..., n. 

The permutation invariance can be proved in a same way as | detT| = 1. 

The Proposition |17| proves the Dominance property. 

By Property [lj a p ( y) = 0 if and only if the components y = Hs are independent. So, H 
should be a trivial filter in T. This proves the Discrimination property. ■ 

Similarly, let us decide whether HFD - with and without normalization is qualified to be a 
BSS contrast or not. 


Proposition 24 or Tp : n x B ■ S — > M is a contrast for linear BSS of sources with 

bounded support, where: 

or (H; x) = $£(y) d = -A p (y) = -d p (H y (y), Z y (y)) 


Proof Let us verify the scale invariance property of the contrast for both without and 
with normalization. Let T 6 T be n x n diagonal scaling matrix, as a trivial filter, with 
the non-zero diagonal entries U,i = 1 ,,n. 

To simplify, let us start with the Hessian of one dimensional transformed variable. 


Y = aX => p Y {y) 


d 2 pY(y) 

dy 2 




d 2 pv{y) 
dy 2 


dy 


-px 





d JYM ix 

dx 
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Let y = Tx. 


a p ( y) = 



ED 

V i=l i =1 


11—3p 


IIA,-(x)|L 


1 

P 


This proves, A p (y), without normalization, is neither scale invariant nor relative scale in¬ 
variant. So, without normalization it is not a BSS contrast, though being an independence 
measure. 

But, as already discussed the measures are applied on normalized densities i.e. densities 
with zero mean and unit variance, the scale invariance property is satisfied. Corresponding 
to normalization, ti = 1, Vi = 1,2,..., n 

The permutation invariance can be proved in a same way as | detT| = 1. 

The Proposition |18| proves the Dominance property. 

By Property [2j A p (y) = 0 if and only if the components y = Hs are independent. So, H 
should be a trivial filter in T. This proves the Discrimination property. ■ 


6.1 Local Minima Analysis of the Proposed Contrasts 

The contrasts defined using L v -norm over FD, GFD and HFD have one more advantage 
that they do not have any local minima. This is a known property of L p -norm, p > 1, 
proved as under: 

^4-vnii/wii^pii/wir 1 

d\\f(x)\\ p 

r/ || JJ-VU ||/(z)||p = 0 => ||/0c)|| = o => f(x) = 0, Vx 

So, there is no separate proof required to show that the contrasts A p (y(8)), a p (y(8)) and 
A p {y(8)) do not have local minima with respect to the corresponding functions. But, still 
they may have local minima with respect to 9. Also, the estimation method may add local 
minima. Actually, it could be easily proved that the contrasts may contain local optima, as 
under. 


VAp(yo) = 0 

=> Ap(yo) = c (an arbitrary constant) 


Obviously, as only c = 0 imply independence, other values of c correspond to possible local 
optima. The more detailed analysis follows as under. 

Let x = (xi,X 2 , ■ ■ •, x n ) T be a bounded random vector and S = (Si ,..., S n ) T be a ‘small’ 
random vector. Then, the interest here is in the differential of <3?^ or || A(x + <5)|| —1| A(x)|| . 


A ( X + <5)II P - ||A(x)||„ 



+5i(x) -p x+5 (x) 


p 

dx 


n^(x)-p x (x) 


i =1 


V 

dx 
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Assuming t as the support of all the PDFs, 


|A(x + , 


ll p -||A(x)|| p = 




i —1 


dt — 




i=i 


dt 


= j\a — b\ p — \c — d\ p dt using symbolic notations 


where, a = ]Xi=i Pxi+Siit), b = p x+s ( t), c = YYLiPxii*) and d = p x { t). 

Let’s assume p = 1: 


|A(x + , 


ii - 11A (x) 11 x = 0 


Either / \a — b\ dt = / |c — 


J t J t 

or \a — b\ = \c — d\ , Vt 
or a = b and c = d. Vt 


d\ dt 


The condition f t |a — b\ dt = f \c — d\ dt do not assure gradient zero for optimal indicating 
independence condition. 

As per \a — b\ = |c — d\ , Vt, four different cases can be thought: 


Case I : a > b, c > d 
Case II: a > b, c < d 
Case III: a < b, c > d 
Case IV: a < b, c < d 


=4>a — b = c — d a — c = b — £ x (x) = Cx(x) 

=4>a — b = — c + d^>a + c = b + d^> spurious optima 
—a + b = c — d=^a + c = 6 + d=^ spurious optima 
—a + b = — c + d=^a — c = b — £ x (x) = Cx(x) 


The Case I and Case IV imply independence but not the other cases. 

The condition a = b and c = d, Vt also implies independence. 

Over all, the analysis implies that the contrast may have gradient zero indicating 
spurious maxima. 

Let’s assume p = 2: 


I|a(x + <s )|| 2 - ||a(x )|| 2 = o 

=^-Either / |a — &| 2 dt= / |c — d\ 2 dtdt 

J t J t 

or |a — b\ 2 = \c — d\ 2 , Vt 
or a = b and c = d, Vt 

The condition f t |a — b\ 2 dt = f t \c — d\ 2 dt do not assure gradient zero for optimal indicating 
independence condition. 

As per |a — b\ 2 = \c— d\ 2 , Vt => two different cases can be thought: 

Case l:a — b — c + d = 0 =$■ a — c = b — d =$■ £ x (x) = Cx(x) 

Case II: a — b + c — d = 0^a + c = b + d^> spurious optima 

The Case I imply independence but not the Case II. 

The condition a = b and c = d, Vt also implies independence. 
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Over all, the analysis implies that the contrast may have gradient zero indicating 
spurious maxima. 

Same way, for other values of p also, existence of spurious optima can be proved. 

Also, in a similar way, possible existence of local optima for contrasts <f>“ and can be 
proved. 


6.2 FD and its Stochastic Gradient 


The previous relation of FD, GFD and HFD reminds us the relationship between mutual 


information and the SFD. As proved by Babaie-Zadeh et al. (2004), SFD is the stochastic 


gradient and can be used to derive differential of mutual information. Also, it has been used 


to derive that mutual information has no local minima (Babaie-Zadeh and Jutten, 2005). 


So, it will be desired to investigate whether such results can be obtained with respect to 
FD, GFD and HFD. 

Let us try to obtain differential of FD, in terms of GFD as defined in Section |3j Let 
x = (xi,X 2 , • ■ ■, x n ) T be a random vector and 8 = (Si,..., 8 n ) T be a ‘small’ random vector. 
Then, the interest here is in the differential function of FD that is, A(x + 8) — A(x). 


A(x + <5) - A(x) = + <5i) -Px+< 5 ( x + <5)J - -p x (x) 

Assuming t as the support of all the PDFs, 

A(x + 8) - A(x) = ( flPxi+SiiU) -Px+*(t) ) - ( f[PxM) -Px(t) ) 


\i=l 


1 


Using Lemma 1 in (|Babaie-Zadeh et al. 2004), the following holds. 


d 


Px+«5(t) - Px(t) = - ^2 —{Es i {5i\x = t}p x (t)} + o(5) 


i =1 


= -E s {5 t Ux)} + o(6) 

Same can be applied to the product of the marginal PDFs, itself being a PDF. 

n n n Q n 

n pzi+SiQ) - n^c*) =_ yi + 


i =1 


11^^ y Z—/ f)t 

1=1 1=1 1=1 

= -E s {6 t Ux)} + o(5) 


(26) 

(27) 


(28) 


Combining Equation (27) and Equation (28), the differential function of FD can be given 

by, 


A(x + 6) - A(x) = -E 5 {8 T a x (x)} + o(5) 
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This is the differential function and to convert it into a number, let us simply integrate it 
over t. 


=>■ V f A x (x) = / (A x+<5 (t) - A x (t))dt = f E s {S T a^{x.)}dt + o(6) 
ix J t J t 

( A(x + 5) — A(x)) dx = f E${5 t cx x (x)}cix + o(5) = 5 T f a x (x) + o(5) 

J X J X 

A(x + (5) — A(x) 


V / A x (x) = lim 

./ x 5—>-0 


= / c* x (x)dx 


Similarly, => E{A(x + <5) — A(x)} = E{£)5{<5 T a: x (x)}} + o(<5) = d T E{a: x (x)} + o(5 ) 
=» V£'{A(x)} = E{a(x)} 


The above result proves that the GFD ( a ) serves as a stochastic gradient of the integrated 
Function Difference or expectation FD of a random vector. So, it could have been easier 
prove that A(x + 5) — A(x) = 0 cc(x) = 0 and that implies independence. But, the 
similar can not be proved for their corresponding l p measures i.e. A p (x + 6) — A p (x) = 
0 <£> a p (x) = 0 can not be proved. The reason is the contrast defined use the ZT-norm 
of FD and not just the integration or expectation of FD, as this quantities do not assure 
nonnegativity. So, the effort to prove that the contrasts are without local minima in the 
previous Section |6.1[ actually resulted into the proof for possible existence of spurious local 
optima for them. 

Overall, the contrast T“(y(0)) and T p (y(0)) do not have any local 

maxima with respect to itself. But, it may still have local maxima as a function of 9 (or 
some other variable), as y itself is a function of the search parameter 9. The next Section 
[7] focuses on the empirical estimation of these contrasts. 


7. Preliminary background on Estimation of the Derived Contrasts 


Usually, the independence measures avoid estimation of joint PDF, as higher dimension joint 
PDF estimation is less accurate or requires more samples than marginal PDF estimation. 


The article (Pham, 2003) notes that the measures based on estimation of joint PDF and 


marginal PDF both, try to cancel out estimation errors compare to the measures only 
estimating the marginal entropies. The minimization of L p -norm of FD, GFD and HFD 
are the BSS contrasts belong to this class of contrasts. The conventional way is to estimate 
them following a two stage process. In the first stage, separate estimation of joint PDF and 
marginal PDFs for <h^, their gradients for (/>“ and their Hessians for is achieved. Then, 
the second stage estimates their difference or ZT-norm. The separate estimation of the PDFs 
and their derivatives can be achieved through histogram based technique or kernel based 
method. The histogram based PDF estimation method is fast but less accurate compare to 
the kernel method. The estimation theory basics says that two stage estimation process for 
a required quantity amplifies the error in estimation. So, either separate estimation of joint 
and marginal PDFs and then their difference or the first joint PDF estimation, then based 
on it the marginal PDFs estimation and then the difference - this both way are indirect 
estimation method. Compare to them, the direct estimation of the required quantity from 
the data is supposed to be less erroneous. Though theoretically any real p > 1 is allowed, 
either p = 1 or p = 2 are more suitable for computation. The Kernel theory says that a 
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quantity based on the square of the PDF requires less computations than that based on 
PDF; if a Gaussian kernel is used. 

In general, compare to the estimation of PDFs, their derivatives and Hessians have more 
inaccuracies or require more samples for same precision. So, the article derives only the 
contrasts based on FD and GFD. In the light of these observations, there is proposed direct 
estimation of the L 2 based contrasts using ‘least squares’ approach. There are two different 
estimation approaches based on the sample locations selected to place the kernel basis. The 
first approach is to select the joint sample locations to place the multivariate kernel basis. 
The corresponding estimator for FD is identified as \&lsfd anc j ^hat for GFD is identified 
as ^ 2 SGFD ■ The methods require 0(b 2 ) computations, where b is the number of basis 
selected. The another approach places kernel basis at selected paired or un-paired sample 
locations. It requires 0(6 3 ) computations with better estimations. It is to be noted that the 
estimation of the same contrasts without the least square based approach requires 0(N 2 ) or 
0(N 3 ) order of computations where N is number of samples. Also, using Fast Gauss Trans¬ 
form (FGT) and Incomplete Cholskey Factorization the computational complexity can be 
further reduced. Similar methods are already in use for direct estimation of density differ- 


ence ( 

Sugiyama et al. 

2013b 

), density ratio (Sugiyama et al. 

2012; 

Yamada et al. 

2013) 

and squared loss mutual information ( 

Sugiyama et al., 2013a 

Sugiyama, 

2013; 

Sakai and 


Sugiyama ; 2014). The information potential due to such an arrangement of basis functions 


is identified as the Reference Information Potential (RIP). The article extends Information 
field theory to incorporate the new concepts of Reference Information Potential (RIP) and 
Cross-RIP (CRIP). The concepts are demonstrated, through above four estimators, to be 
useful to derive closed form expressions for information field analysis. 


7.1 Kernel Basics and Information Potential 

Given N realizations of an unknown PDF f(x), the kernel density estimate f(x ) is given by 


/(*) = 




(29) 


where, K(u) is the kernel function and h is the bandwidth parameter deciding the spread 
of the kernel. Usually, K(u) is a symmetric, positive definite and bounded function, i.e. it 
satisfies the following properties: 

/ oo /*oo /*oo 

K(u)du = 1, / uK(u)Du = 0, / u 2 K(u)du = ^(K) < oo 

-oo J —oo J —oo 

It is known that the convolution (symbol V) of two Gaussian functions is still a Gaussian 
function(G(-, •)). In a single dimension, 


G(v, cti) * G(u - v, <j 2 ) = G( u, \J(y\ + cr 2 ) 

Let us use this property to estimate the expectation of the square of PDF. Let the Gaussian 
kernel be given as, 


G a (x - m x ) = 


/ 


—oo 


exp 


_ 1 ! X—TTlx 

2 \ a I 


dx 


TUT 


22 







































Then, 


1 


N 


{f( x )~}dx = I ^ ^ G a (x - Xi) J dx 

N N 


/ °° 1 x _ 

—2 ^ ^ G a (X - Xi)G a (x - Xj)dx 
‘°° j= 1 i=1 

1 POO 

— 2 ^^/ G r7 (x - Xi)G a (x - Xj)dx 
■ i ■_i J —oo 


J=1 i=l - -°° 
iV AT 




JV 2 


j=l i=l 


Thus, the integration of the square of PDF is achieved in a computationally efficient way, 
avoiding the continuous integration. The ITL theory has given significance to this property 
by identifying it as a quadratic information potential. The details on IP, related indepen¬ 
dence measure QMIed and information forces follow in the Appendix [Bj 


8 . Extention to IP Theory 

One of the interpretations describes potential as the amount of work done required to bring 
a unit charge (for electric field) or unit mass (for gravity field) from infinity to the point 
in the force field, where infinity implies a point with zero potential. The particle contains 
amount of potential energy that has been applied to work against the force. It is customary 
in potential theory to think of a reference potential i.e. assuming that a particle is moved 
from a reference point in the field instead from the infinity. This helps analyzing a potential 
or gravitational field through a reference framework instead absolute. In a gravitational 
held theory, the potential energy at a hight from a sea level reference or some other local 
reference; in an electric held theory potential difference with respect to the common/neutral 
of the system or earth - are the respective examples. Moreover, during held analysis it is a 
general practice to start with a reference potential and then to express the required related 
quantities as a function of this reference potential. For example, in nodal analysis for 
electrical circuit analysis, reference potential is assumed at every node. 

Once defined IP, the natural query is whether it is possible to derive the concept of 
reference potential for information helds? Further, whether there can be derived some laws 
for information held analysis using the reference IP concept? The hrst question is answered 
defining RIP and related quantities in this section. The second question is answered in the 
next Section [9l 

8.1 Reference Information Potential (RIP) 

In kernel analysis, it is customary to initially assume a set of kernel basis placed at some 
selected sample points and then the required quantities are expressed as a function of 
the basis. The potential due to kernel basis can be identihed as a Reference Information 
Potential (RIP). Analogous with the laws in electric circuit analysis, the least squares like 
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approaches can be thought to bring functional relationship between a required quantity and 
the reference potential. 

Let 'f'(x') = {^(xi), ip(x2ip(xb)} be the set of kernel functions consecutively 
placed at selected sample location:^] Xj ,i = 1,2 ,b. They act as basis as potential at any 
point in the field is measured using linear combinations of them. . The selected sample 
points can be seen as the occurances of a random variable X^. Then, the potential of X^ 
(y a (X^)) is a Reference Information Potential (RIP). More specifically, quadratic RIP is 
the integral of the square of the PDF of X^ (p x . 0 (x) ), as under: 


RIP 2 = Vl = /pL (x)dx 


Vr = / Px, h ( x )dx 


j ^ (X - Xj)^ dx 


b b 


1 

6 2 

j=i *=i 
b b 


y - Xi)lp a (x - Xj)dx 


\yy [ i’a{x - Xi)4’ a (x - Xj)da 


b' 2 

j =1 i= 1 

= v 2 (X^) 

For a Gaussian kernel, ip(xi) = G(xj), the following holds: 


Vr 


1 

P 



X - Xi)G a (x 


j= 1 *=1 


Xj)dx 


The quadratic RIP definition can be generalized to a RIP, as: 


RIPa = d = [ipx^x)) a dx 

J X 

Once defined RIP, two more related concepts can be defined to bring the closed form 
expression for information field analysis. 


8.2 Cross Reference Information Potential (CRIP) 

The Cross Information Potential (CIP) is defined as the entropy of a PDF /(x) with respect 
to an another PDF g(x): CIP = E{f(x)} = f f(x)g(x)dx. With reference to the newly 
defined RIP concept, entropy of a PDF /(x) with respect to the reference PDF px 0 is called 

2. usually, the basis are placed at sample points but can be placed at some other points in the field 
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CRIP. The CRIP estimates the potential on the selected locations as the basis due to the 
interactions of locations from the sample space of f{x) (or vice versa). 


CRIP 2 


V 2 R (f) = V 2 (/,^) = j f(x)px 4 dx 

r 1 N x h 

V 2 (/,^) = J - x f {i))-^2lpa(x - x(j))dx 


1 

Nb 


b N r 

EE ip*(x - Xf(i))tp (7 (x - x{j))dx 

3=1 1=1 Jx 


For a Gaussian kernel, 'ib(x — Xi) = G(x — x t ) : then: 


Vr 


1 

m 

1 

Nb 



f (i))G a (x - x(j))dx 


J2J2 G ^V2( x f( i ) ~ x U)) 


8.3 Information Interaction Matrix (IIM) 

The analysis may not just require the final scalar outcome, but may depend upon the 
intermediate information interactions. So, let there be defined an Information Interaction 
Matrix (IIM) as the matrix due to each interaction. There can be IIM for potential, IIM for 
reference potential and IIM for information forces etc. For example, the held with N sample 
points will have N 2 interactions that will be the size of the IIM for potential. Similarly, the 
IIM for reference potential will be of dimension bxb and IIM for CRIP will be of dimension 
N xb. This is analogous to the Gram Matrix. Let us symbolize (V a (xi, xj )) as the potential 
on xj due to interaction with X{ and ~V a (X) as the IIM for the potential of random variable 
X. Also, V a (X) is already symbolized as the scalar quantity IP of X. Accordingly, V^t 
is the reference potential and Vr is the Reference potential IIM. In short, V(x(i), x(j)) = 
f il>(x,x(i))il>(x,x(j))dx, [\ 2 {X)} ij = V(x(i),x(j)) and V 2 (X) = J^jLi Ei=i[ v 2p0];.r 
Similarly, V(x/(i),x(j)) = f x i/j a {x - Xf(i))ip a (x - x(j))dx, [V 2 R ]ij = V{xf(i),x(j)) and 

V£(*) = ]&£}=! ELtV&r 


9. Closed-Form Expression using Reference Potential through Least 
Squares 

The section targets estimation of the contrast using closed form expressions, in terms 
of the RIP and related concepts. The mathematical expression is of type closed form if it 
requires finite number of constants, variables and operations. Instead of the conventional 
two stage approach, here, the estimation is achieved directly in a single stage through 
‘least squares’. Similar methods are already in use for direct estimation of density differ¬ 


ence (Sugiyama et al., 2013b), density ratio (Sugiyama et al., 2012; Yamada et al. 2013) 
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and squared loss mutual information (Sugiyama et al. 


2013a 

Sugiyama, 

2013; 

Sakai and 


Sugiyama, 2014). Also, it is worth noting that the LpFD or ||A(x )|2 independence mea¬ 


sure, for <I >2 ^ contrast, is same as the Euclidean Distance based Quadratic Independence 
Measure ( QIMed ) by Principe) (2010). 

The method of ‘least squares’ aims at estimating the model parameters that min¬ 
imize the sum of the squares of the errors between the true and the estimated quantity. 
Without loss of generality, let us use this approach to estimate A(-) of a two dimensional 
random vector and then the results be generalized to higher dimensions. The A(-) of a 
two-dimensional random vector is: 


A(x, y) := p xy {x, y) - p x (x)p y (y ) (30) 

Let g(x,y) = A (x,y) be the estimated A (x,y) and LSFD be the Least Squares based FD 
estimator. Then: 


LSFD = 


{g{x,y) - A (x,y)) 2 dxdy 


g{x,y) dxdy-2 / / g(x, y) A(x, y)dxdy + / / A (x,y) dxdy 
fyJx JyJx JyJx 

= V 2 (g(x,y)) — 2V(g(x, y), A(x, y)) (v the last term has no effect on LSFD) (31) 


where, V2(g(x,y)) = ||A||2 is the potential of the estimated A (a:, y) or the QIPed and 
V(g(x,y), A(x,y)) = || AA|| is the cross information potential between the actual and 

both quantities represent the required contrast. But, as proved by 
), the linear combination of them, the LSFD is more bias corrected 

Let us assume further that g(x, y) is given by a linear combination of the selected basis 
functions placed at the selected sample points. 

b 

a(x, y) = Y_16iipi{x, y) = 6{x, y) T ^(x, y) (32) 

Z=1 


estimated A (x, y). So, 


Sugiyama et al. (2013b 


estimator. 


where, b denotes the number of basis functions; 0(x,y) = (6i,6b, ...,0b) T is the parameter 
vector and T(x,y) = (Vt ^ 2 , •••, ^b) T is the basis function vector. So, with regularization 
function R(8) = 6 T 0 and A as the regularization parameter, 


LSFD {6) = 6 t V r 8 - 2h T 0 + A 8 T 6 
where, V R(6xfe) = J j V(x, y)^> T {x, y)dxdy 

hfcxi = j j ^(x,y)(p xy (x,y) - p x (x)p y (y))dxdy 

N 

hi = V R {il){x h yi), A) or [h }i = ^ [Vr(T(x^, yi), A) 7 ] t i 


( 33 ) 


2=1 
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The estimator depends upon the IIM for RIP and the IIM for CRIP of A(x, y). The optimal 
value of parameter vector 9(x,y ) can be obtained by minimizing the gradient of LSFD. 


9* 

3LSFD 

89 

9* 


argmin LSFD(0) 
e 

X R 9 + X9 - h 
(V fl + AI, 


(34) 


where, I& is a b-dimensional identity matrix. Thus, obtaining IIMs V# and V r (4/(xi, yi), A) 
gives the parameter vector ( 9 ), least squares estimator (LSFD), and V 2 (A) = 11A|1 2 = 
9 1 V r 9 all. This justifies the purpose to define the quantities RIP, CRIP and IIMs. 

10. $ 2 ° Estimation through Multiplicative Kernel Model 

Let us use multiplicative Gaussian kernel function as a basis function placed at the selected 
sample points. So, 


b 

g{x,y) = ^ 2°iK{x,Xi)L(y,yi ) = 9 1 [k(x) o %)] (35) 

i= 1 


where, K(x,Xi ) and L(y,yi) are the kernel functions at Xi and yt consecutively; 

k(x) = (. K(x,xi),K(x,x 2 ),...,K(x,x b )) T and l(y) = (L(y, yi), L(y, y 2 ),..., L(y, y b )) T are 

the kernel vectors and the operator o denotes Hadamard product. This gives 


[Vr (x,y)]ij 
=> v R (bxb)(x,y) 
or [V R (x,y)]ij 


j J K(x,x i )L(y,y i )K(x,x j )L(y,y j )dxdy 


Vr(x) o X R {y) 



(Xj - Xj ) 2 

4ct 2 


(Vi -Vj) 2 \ 
4a 2 J 


where, X R (x) is a 6x6 matrix with entries [Vi?(x)]y = K(x,Xi) * K(x,Xj ) and X R (y) is a 
6x6 matrix with entries [V R (y)]ij = L(y, yi) * L(y , yj ) and > 1 = is the symbol for convolution 
operation. 

The IIM for RIP Vr(6 X 6) 1 f° r an n-dimensional quantity, is obtained using 6” multi¬ 
plications. The computations can be reduced by replacing multiplications through additions 
of the exponents. This will require square of the n6 2 terms, additions of exponents and 
then taking exponents of 6 2 terms. Now, the sample estimate of h (h) can be obtained as 
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under: 


h 


(6xl) 


(k(x) o 1 (y))(p xy (x,y) — p x {x)p y (y))dxdy 


hi =h’ l - h'{ 

where, h, = j j ( K (x,xi)L(y,yi)) ( ^ ^ ( K (x, Xi)L(y, y*)) j dxdy 


i —1 


N 


N 


^2v R (xi, xi)V R (yi,yi) 


i =1 


AT 


1 ^ [ (Xi - Xi) 2 + {yi - yi) 2 

Wiv^ exp 


2=1 


4ct 2 

TV TV 


and h" = J J(K(x, x t )L(y, yi )) I ^2 *52 ( K ( x ’ Xi ) L( dJ' %')) J 


i=i *=i 


tv 


TV 


AT2 

1 

iv 2 


^ V^Xj,^) ^ ^V R {yj,yi ) 


i=l 


TV 


D=i 

TV 


^Vr(x;,xj)J ( XI V R(yj>yi) 

3 = 1 


V Z =1 


(36) 


The estimation of 6,) is obtained by replacing the Hadamard product through addition of 
the exponents, as for the estimation of V R (x,y). Each entry requires n — 1 additions, 
then exponent followed by N additions. Each entry /i" is obtained through nN additions 
and (n — 1) multiplication. So, over all vector h' requires (N + n — 1)6 additions and Nb 
number of exponents. The estimation of vector h" requires nNb additions and b(n — 1) 
multiplications. 


Once y R (x, y) and h are available, the linear coefficients (9) can be obtained solving 
Equation (34). The time complexity for this is 0(b 2 ). Based on Equation 31, the required 
= — V^A), the CIP(p XlX2 (xiX2),Px 1 (xi)Px 2 (, x 2)) and the least square approximation 
error (LSFD) are estimated. Also, the method estimates both the Function Difference (A) 
and $2 of a random vector simultaneously. The time complexity is usually measured in 
terms of the number of multiplications. With this the total multiplication time complex¬ 
ity is only 0(6 2 + b(N + n — 1)). It can be further reduced by taking exponent of values 
corresponding to (xi — Xj ) 2 < (3c) 2 or (y,; — yj) 2 < (3 a) 2 as zero. Though not the time 
complexity, the performance directly depends upon the number of samples available; specif¬ 
ically in higher dimensions. To effectively increase the available samples for estimation, the 
next section uses basis placed at both paired and unpaired samples to estimate A. The 
estimator is identified as LSFD2. 
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10.1 LSFD2 Estimation through Multiplicative Kernel Basis Placed at Paired 
and Un-paired Samples 

The estimation method places the multiplicative kernels as basis at unpaired samples also. 
This allows the use of Kronecker structure to reduce the computational cost. The approx¬ 
imation g(x, y) is defined as: 

b b 

g(x,y) = X/ X! Xi)L(y, yj) 

j =1 i =1 

= vec(0) T [(I{, <8> k(x)) o (1 (y) <8> I 6 )] 

where, 0 is a 6x6 parameter matrix, vec(-) is a vectorization function and <8> implies the 
Kronecker product. Accordingly, 

V R (i*(b-l)+j,k*(b-l) + l) = j J K(x,Xi)L(y,yj)K(x,x k )L(y,yi)dxdy 

^V R{b n xb ™)(x,y) = V R (y) (8 V r(x) (37) 

where, Vr(x) is a 6x6 matrix with entries [V^(x)]y = K(x,Xi ) * K(x,Xj ) and ~V R (y) is a 
bxb matrix with entries [V R (y)]ij = L(y,y t ) * L(y,yj). 

The sample estimate of h (h) can be obtained as under: 

h(b«xi) = J J [(I&® k(x)) o (l(y) ®lb)](p X y(x,y) - p x (x)p y (y))dxdy 

^ N ^ N N 

h(i*(b-i)+v) = v ( Xi ' s i)V(yi, y[) - ^ 

2=1 

Accordingly, h = h 7 — h" 
where, 


j =i i= l 


(38) 


/ _ / ^ vec (Vfl(®)Vfl(y)) , 


h 7 = 


for n = 2 


h" = 


\[h 7 ] i+(P _ 1) * b+( ,,,_ 1) * b 2 = ^ V(a?i, ari)V(j/i, for n = 3 

[h"l/4-m_n*h = vw ( yijli V(x ?; ,x/)) ( y/JL, V(yi,yi) ) for n = 2 


-"W-i)*» = s* (EiiVfe.x,)) (Ef.iV(w,»,')) 

. Efal v ( x h x l) Ejh j(f) Ehi V(2», 2,") for n = 3 


The equation of h 7 for n = 2 is not extendibles as it is to n > 2. The equations for n = 3 
show the way to get generalization for higher dimensions. As explained previously, the 
estimation of vector h 7 requires (N + n— 1 )b n additions and Nb n number of exponents. The 
estimation of vector h 77 requires b 2 N additions and n(b n ) multiplications. 


To estimate the optimal parameter matrix 0 is needed. The Equation (34) can be 
written as: 

Vfjvec(0) + Avec(0) = h 
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This is the famous discrete Sylvester equation and requires 0(b 3 ) computations to solve it. 
Now, the Equation (33) can be given as under: 


LSFD2 = vec(0) T (V ^{y) (g V^(.T))vec(0) — 2h 7 vec(0) 
= vec(@) T vec(Vii(x)@\'ii(y) T ) — 2h r vec(0) 

= trace(0 T VR(x)0VR(?/) T ) — 2h T vec(0) 


The computational complexity of above direct estimation of least square error in FD esti¬ 
mation is 0(b 3 ). Overall, the multiplicative kernel used as basis at sampled and unsampled 
pairs has 0(b 3 + (N + n)b n ) computations but expected to be more accurate specifically, 
with higher dimensions and less number of samples. 


10.2 A Note on the CRIP Estimations in above Derivations 

The least squares method has been used for direct estimation of density difference and 
density ratio, as it is mention in the Section |9j One could have noted that the way h or 
the V(x) has been calculated in this article is different from it is elsewhere. For example, 
calculation for hi = f ip(x, Xi)p x (x)dx in both the ways is demonstrated here. This article 
simplifies hi as under: 


hi 


G a (x, Xj)p x (x)dx 



' ^ N \ 

^' 52 G a (x,Xi)dx 


2=1 


1 

N 


N 

^2 G V2a( X ^ X j) 

2=1 


Intuitively, the kernel interacts with each sample of the PDF p x (x). The interaction is a 
convolution resulting into Gaussian with parameter \/2a. 

The other articles simplify hi as under: 


hi = G c 


f ( 1 N 

(x,xi)p x (x)dx = / G a {x,xj) f -^y^5(x,xj)d: 


— y G<t ( X j X i) 

3 = 1 


This is also correct, as hi can be thought as a convolution of the actual PDF with direct 
delta. 

Overall, both the approaches are correct. But, the first approach is more precise, as 
better approximates the PDF p x (x) through Gaussian kernel, than the delta kernel in the 
second approach. The empirical results also justify this approach. Actually, the Cross IP 
estimation in (Principe, 2010) follows the first approach though it seems conventionally the 
second approach is more popular. 
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11. ( 1>2 F,) Estimation 


The least squares approximation of a(x,y) = V(A (x,y)) and A(x,y) = V 2 (A (x,y)) can 
be achieved in the same way as that for A in the previous sections. In general, A ^ r \x,y) 
is the r th order derivative of A and can be estimated using the linear approximation 
g(x,y) := A ( \x,y) = d(x,y) T ^^(x,y). It is a customary approach to use multiplica¬ 
tive kernels for multivariate density estimation and then multiplicative derivative kernels 
for the derivative of multivariate density estimation. Accordingly, all major equations for 
A.( r \x, y) estimation remain as they are in A(x, y) estimation, with simply the basis \k(x, y) 
replaced by y). For example, with Gaussian kernel T(x,y) = Gh(x)oGh(y ) and 

ik^x, y) — G^\x)oG^\y) = h~ r H r (x)G(x)oh~ r H r (y)G(y). Let the least squares esti¬ 
mator for GFD be called LSGFD (Least Square GFD) and be derived as under: 

LSGFD = J j ( g(x , y) - a(x, y)fdxdy 

= V 2 (g(x,y)) — 2V(g(x, y), A(x, y)) (y the last term has no effect on LSFD) 

(39) 


where, Vi(g(x,y)) is the ||d|| 2 and V(g(x,y),a(x,y)) is ||da||. So, both the quantities 
represent the required contrast. But, as proved by Sugiyama et al. (2013b) the linear 


combination of them, LSGFD, is more bias corrected estimator. Also, let 

b 

g(x, v) = ^2 di ^ x ' y )= e ( x ' y) 1 y) 


(40) 


i= 1 


where, b denotes the number of basis functions; 0(x,y) = (#i, 9b, ..., 0i,) T is the parameter 
vector and *k(x,y) = (^i, ip 2 ,..., 'ipb) T is the basis function vector. So, with regularization 
function R{9) = 6 T 0 and A as the regularization parameter, 

lsgfd($) = e T \ R e - 2h T 0 + \e T e (41) 


where, V fi(6x6) 


*k(x, y)\ k T (x, y)dxdy 


hfoxi — 


'&( x ,y)(yPxy{x,y) - V(p x (x)p y (y)))dxdy 


hi = V R {i>{xuyi),a) 


The optimal value of parameter vector 9{x, y) can be obtained by minimizing the gradient 
of LSGFD and obtained as 


e* = (Vr + AIfe) _1 h (42) 

where, I& is a b-dimensional identity matrix. 


11.1 & 2 FD Estimation through Multiplicative Kernel Model 

Let us use multiplicative Gaussian kernel function as a basis function placed at the selected 
sample points. So, 

b 

g{x, y) = 9iK ( x > x i) L (y> yi) = ° J I k ( x ) ° %)] ( 43 ) 

i=l 
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The V r(x, y) is already derived for LSFD estimator. The sample estimate of h (h) can be 
obtained as under: 


h 


(6xl) 


(k(x) o \(y))(Vp X y(x,y) - V (p x (x)p y (y)))dxdy 
hi = h[ — h'(, where 

K= f [ ( K {x,xi)L(y,yi)) 


Na 2 


N 

£ 

i= 1 




a 


K(x,Xi 


y - Vi 
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L(y, yi ) dxdy 


7T 


d/2 


N 
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Xi - Xl 


V R {xi,xi) 


yi - yi 
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( 2a) d (Na d ) f— j' \ a 

- 77 , r- \ 2 d , r- x i - - Vi ) ex P 

(2n) d / 2 (V2a) 2d (y/2a) d N tt 


VR(yi,yi ) 


(x» - x z ) 2 + (yt - yi ) 2 

In-’ 


h t = 11 ( K(x,x t )L(y,y l )) 

TT d / 2 


1 


N 2 a 2 


N N 

££ 

j=i *=i 


a; — Xi 


a 


(2cr) d (Na) d \ cr 


N 

£ 


Xj - Xl 


N 


Vr{Xi,Xi) £ 




K(x, Xi 


yj - yi 

(7 


y-yj 


a 


L(y,Vj) dxdy 


VR.{yj,yi ) 


(44) 


Thus, the parameter vector (9), the scalar value LSGFD = ||A ||2 and the the A (x,y) - all 
are obtained in terms of the reference and cross reference IIMs. This justifies further the 
purpose to define the quantities RIP, CRIP and IIMs. 

The interaction matrices for RIP (V R ) and CRIP (V) for LSGFD estimation could be 
same as those used to estimate the LSFD. But, for more precise estimations it is better 
to recalculate them using suitable bandwidth parameter for density derivative estimator, 
which is usually smaller than that used for density estimation. 

11.2 & 2 FD Estimation through Multiplicative Kernel Basis Placed at Paired 
and Un-paired Samples 

Similar to the LSFD2 estimator, LSGFD2 estimator can be derived using multiplicative 
kernel basis at unpaired samples and Kronecker structure to achieve precise computation. 
The approximation g(x, y) is defined as: 


9(x,y) 


b b 

££ 9i j K(x,x i )L(y,y j ) 
i=i *=i 


vec(©) r [(I 6 (8> k(x)) o (1 (y) <g> I 6 )] 


where, © is a b x b parameter matrix, vec(-) is a vectorization function and <g> implies the 
Kronecker product. 
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The estimation of Vr is already discussed. The sample estimate of h (h) can be obtained 
as under: 


1 (6 n xl) 


J j [(Ifo®k(x)) o (l(y) <8> Ife)] (Vpxy(x, y) - V (p x (x)p y (y))dxdy) 


h 


TT 


d/2 




N 

E 


Xi - Xl 


(2a) d (Na d ) \ a 

7 T d / 2 


V R (xi,xi ) 


Vi-Vl 


a 


VRiVuVl) 


(2cr) d (Na) d 


f N 

E 

vi=l 


Xi - Xl 
a 


V R (xi,xi) J j ^2 

3=1 


N , , 

Vj - Vi 


a 


VR(Vj>l/i) 


To estimate the optimal parameter matrix 0 is needed. The Equation (34) can be 
written as: 

V^vec(0) + Avec(0) = h 

This is the famous discrete Sylvester equation and requires 0(6 3 ) computations to solve it. 
Now, the Equation (33) can be given as under: 

LSGFD2 = trace(0 r V iJ (x)0V /? (y) T ) - 2h T vec(0) 


12. Empirical Verification of the Derived Estimators as Independence 
Measures 

The derived four estimators - LSFD, LSFD2, LSGFD and LSGFD2 - need empirical ver¬ 
ification. A simple test experiment is designed that verifies their ability to separate the 
independent and dependent signals. Further testing, as a BSS contrast, has been left for 
the future sections. The estimators need bandwidth parameter (h) selection for multivari¬ 
ate kernel density estimation (KDE) and the regularization parameter A. Conventionally, 
the least squares based direct estimation methods use a Cross Validation (CV) method to 
select both the parameters. The CV method is computationally demanding if good number 
of choices for a free parameter are provided to obtain accuracy in estimation. Instead, the 


Silverman’s rule-of-thumb (ROT) (Silverman, 1986), balancing computation and optimal 
parameter value, is used for selecting (h). The Experiment uses ROT for A = 0.01 for the 
test experiments. 

Experiment (Independence Test): Let there be generated two uniformly distributed 
independent signals V(l,:) and X(2,:) with 300 samples each. Let there be generated a 
dependent signal: Y( 1,:) = sin(V(l, :)/20 * n). Find the estimated values for the indepen¬ 
dent signals - V(l,:) and X(2 ,:) - and dependent signals - V(l,:) and V(l,:). 

The results are tabulated in the following Table [lj Each entry in the Table is a mean of 100 
trials. The results show that all the estimators are able to give estimator value sufficiently 
low for independent signals than dependent signals. 


12.1 Empirical Verification of the Derived Estimators as BSS Contrasts 

By definition, statistical independence implies uncorrelatedness (the opposite is true only 
for Gaussian variable). The uncorrelated components with zero mean imply orthogonality. 
So, the ICs with zero mean are also mutually orthogonal. A rotation matrix for specific 
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Table 1: Performances of the derived independence measures with their estimation tech¬ 
niques: on the test set with independence and dependence signals; number of 
samples 300; kernel bandwidth parameter h using ROT; regularization parameter 
A = 0.01. The table entries indicate mean of 100 trials. 


Test Condition 

LSFD 

LSFD2 

LSGFD 

LSGFD2 

independent signals 

0.4725e-03 

0.5057e-03 

0.2915e-03 

0.1165e-03 

dependent signals 

0.0180 

0.0411 

0.0055 

0.0091 


angle, from the infinite set of all n-dimensional rotation matrices, will be able to transform 
a set of zero mean whiten (orthogonal and univariant) components to ICs. So, the BSS 
problem reduces to estimating a rotation matrix R giving m.i.p. y,;(t)s. 


Y*(t) = argmax 4>(y(t)) 

w 


(45) 


where, <h(y(t)) or 4>(Y(i)) is the contrast function based on the dependence or independence 
measure of random vector y. 

It is known that the conventional information theoretic BSS contrasts give local minima 
for multimodal distributions. So, here an experiment designed to test the local minima of 
the derived contrast and some of the conventional contrasts for comparision. 

Experiment (Local Minima Analysis for BSS Contrasts): The experiment is de¬ 
signed to test the existence of spurious local minima in the optimization landscape of the 
derived contrast for BSS of two i.i.d. sources with varying distributions. The number of 
samples (N) were kept 300. 

The contrasts tested include the derived LSFD (with ROT for bandwidth parameter se¬ 
lection), LSFD (with ExROT for bandwidth parameter selection), LSFD2 (with ROT for 
bandwidth parameter selection), LSFD2 (with ExROT for bandwidth parameter selection), 
LSGFD (with ROT for bandwidth parameter selection), LSGFD (with ExROT for band¬ 
width parameter selection), LSGFD2 (with ROT for bandwidth parameter selection), LS- 
GFD2 (with ExROT for bandwidth parameter selection) and existing least squares based 
independence measures LSMI (with Cross-Validation (CV) for bandwidth parameter selec- 
tion) flSugiyama' 2013), LSMI2 (with CV for bandwidth parameter selection) ( | Sakai and 
Sugiyama, 2014[ ) for comparision. There are defined 21 types of distributions and used first 
20 (type a to t) for this experiment. The first 18 types (a to r) of distributions are suggested 
by Bach and Jordan (2003) and two more skewed types of distributions were added. The 


s type is a GGD with skewness s = —0.25 (left skewed) and kurtosis k = 3.75 and the 
t type is a GGD with skewness s = 0.75 (right skewed) and kurtosis k = 0. Both the 
distributions are generated using Power Method with parameters b = 0.75031534111078, 
c = -0.02734119591845, d = 0.07699282409939 for s type and b = 1.11251460048528, 
c = 0.17363001955694 and d = —.05033444870926 for t type. The u type is a Gaussian 
distribution that is added for some other experiment not reported here. All 21 distributions 
are shown in the Figure [l| 


34 
































Figure 1: Probability density functions of sources with their kurtosis: (a) Student with 
3 degrees of freedom; (b) double exponential; (c) uniform; (d) Student with 
5 degrees of freedom; (e) exponential; (f) mixture of two double exponentials; 
(g)-(h)-(i) symmetric mixtures of two Gaussians: multimodal, transitional and 
unimodal; (j)-(k)-(l) nonsymmetric mixtures of two Gaussians, multimodal, tran¬ 
sitional and unimodal; (m)-(n)-(o) symmetric mixtures of four Gaussians: mul¬ 
timodal, transitional and unimodal; (p)-(q)-(r) nonsymmetric mixtures of four 
Gaussians: multimodal, transitional and unimodal; (s) left skewed Generalized 
Gaussian Distribution(GGD); (t) right skewed GGD; (u) Gaussian distribution 
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12.1.1 Parameter Selection in the Derived Estimators for BSS 


The Experiment justified the use of ROT for bandwidth selection, instead CV for the same. 
But, both the methods have one more problem for BSS like signal processing and machine 
learning applications. Compare to the applications in previous experiment, where the com- 
parision was at an event or at a point, those applications require to find the most optimal 
from a given solution set. If CV method is used, there needs to be found new parameter 
value at every point in consideration. That will be computationally too demanding. The 
ROT assumes Gaussian distribution for the unknown PDF. The feasible solution set for the 
problem is expected to have varying properties, like, varying distances from Gaussianity 
and others. Ideally, same bandwidth parameter is not best for all points. For example; in 
case of the BSS application, the goal is to find the most non-gaussian (independent) com¬ 
ponents. For this goal, assuming Gaussianity for the whole solution set is contradictory and 
sure way to bring estimation errors. This brings the need to use data dependent rules for 
kernel smoothing parameter that takes into consideration the variation in the distributions 
of solutions and is also computationally efficient. Such a rule, identified as Extended ROT 
(ExROT) is derived by Dharmani Bhaveshkumar (2015) based on an assumption that the 
density being estimated is near Gaussian and can be approximated using Gram-Charlier A 
Series. The rule is used for the contrast estimation in BSS in the following experiment. 


12.1.2 Results 

The results of the Experiment for local minima analysis are shown in terms of the plots 
of negative of the contrast value versus the rotation angle theta. The minima of the plots 
corresponds to the actual sources. Ideally, it should be at 9 = 0 or tt/2. The plots show the 
9 values in radian multiplied by 100. 

The comparative study shows that all the derived estimators show minima at the 6 = 0 
and no spurious local minima for unimodal distributions. They have better optimization 
landscape than the LSMI and LSMI2 contrasts. The LSGFD2 estimator with bandwidth 
parameter selection has the best performance for multimodal distributions compare to all 
other contrasts, though it has local minima for distributions Type 4, Type 15 and few 
others. 


13. Conclusion 

The article proves that the Gradient Function Difference (GFD) being zero everywhere 
imply independence. For a bounded support random vector the Hessian Function Dif¬ 
ference (HFD) being zero everywhere imply independence. Accordingly, L p measure of 
FD, GFD and HFD are proved to be independence measures. They are used to derive 
contrast functions for simultaneous ICA and BSS. The contrast functions are proved to 
satisfy the properties of Scale Invariance, Dominance and Discrimination, avoiding spurious 
global maxima. There has also been derived least squares based two methods to estimate 
L 2 of FD and L 2 of GFD contrasts using multiplicative kernel basis. In the first method 
the basis are placed at only joint samples and in the another method basis are placed at 
both paired and unpaired samples. The first method requires computations of the order of 
0(b 2 + N(b+n — 1)) and the second method requires that of the order of 0(6 3 + (N + n)b n )). 
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Figure 2: Plots of LSFD Contrast estimated with bandwidth parameter through ROT ver¬ 
sus theta value for the first 20 distributions a-s stacked rowwise 



Figure 3: Plots of LSFD Contrast estimated with bandwidth parameter through ExROT 
versus theta value for the first 20 distributions a-s stacked rowwise 
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Figure 4: Plots of LSFD2 Contrast estimated with bandwidth parameter through ROT 
versus theta value for the first 20 distributions a-s stacked rowwise 



Figure 5: Plots of LSFD2 Contrast estimated with bandwidth parameter through ExROT 
versus theta value for the first 20 distributions a-s stacked rowwise 
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Figure 6: Plots of LSGFD Contrast estimated with bandwidth parameter through ROT 
versus theta value for the first 20 distributions a-s stacked rowwise 



Figure 7: Plots of LSGFD Contrast estimated with bandwidth parameter through ExROT 


versus theta value for the first 20 distributions a-s stacked rowwise 
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Figure 8: Plots of LSGFD2 Contrast estimated with bandwidth parameter through ROT 
versus theta value for the first 20 distributions a-s stacked rowwise 



Figure 9: Plots of LSGFD2 Contrast estimated with bandwidth parameter through ExROT 
versus theta value for the first 20 distributions a-s stacked rowwise 
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Figure 10: Plots of LSMI Contrast estimated with bandwidth parameter through CV versus 
theta value for the first 20 distributions a-s stacked rowwise 



Figure 11: Plots of LSMI2 Contrast estimated with bandwidth parameter through CV ver¬ 
sus theta value for the first 20 distributions a-s stacked rowwise 
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But, the second method requires less samples for the same performance. The empirical ver¬ 
ification justifies the derived contrasts for BSS applications. But, further experiments are 
needed to have the comparision with other contrasts on separation quality against varying 
number of sources and varying number of samples. The required performance analysis is 
restricted here and targeted in the future versions of this article. 

Appendix A. Measure, Metric, Norm and L p -norm 

Measures necessarily assign some nonnegative number to the members of a set in some 
systematic way. The distance measures or distance functions assign nonnegative value for 
two elements of a set. Let there be set S. Then, a distance function d : S x S —>• M may 
satisfy the following conditions for x,y, z E S: 

1. d(x,y) > 0 (non-negativity) 

2. d(x, y) = 0 iff x = y (identity of indiscemibles) 

3. d(x, y ) = d(y, x ) (symmetry) 

4. d(x, z) < d(x, y) + d(y , z) (subadditivity or triangle inequality) 

The distance functions satisfying first two conditions are called divergence measures and 
those satisfying all four conditions are called metric. For example, if S contains n-dimensional 
vectors then Vx, y £ S, p > 1, d p : S x S —>• M defined as under is a metric. 

d p (x,y) = | \ x i ~ Vi\ P 
\i= 1 

The concept to derive above metric is inspired by the distances in Euclidean geome¬ 
try. The generalization of this distance measure on sets to that on vector spaces is obtained 
by defining a norm. Given a vector space V over a field F, a norm is a function p : V —> M 
with the above four properties of metric and added property of absolute Scale Invariance 
defined as under: 

p(ax) = |a|p(x),Vx E V, a 6 F 

For example, given an n-dimensional vector space W 1 and x E R n ; the L p -norm of x for a 
real number p > 1, is defined as: 

INI p = (ki| p + \x 2 \ p + • •. + \x n \ p )p 

The same definition has been also extended for functions in L p -spaces. A point in 
L p - space is an L p integrable function. A function / : M n —>• M is L p integrable, if p-tli power 
of its absolute value is finite, or equivalently, 

|/( x )llp = \f(^)\ P d^j P < oo 

It is a complete normed space with all the L p integrable functions, their linear combinations 
through real coefficients and including all limit points. 



42 



Appendix B. Information Potential (IP) and related Concepts 

In a general sense, potential means an unrealized ability. The gravitational potential and 
the electric potential are the known examples from Physics. In both the examples, potential 
created by a particle (with mass or charge) is inversely proportional to the distance. In kernel 
density estimation, a kernel is placed at each sample location and usually kernel is a positive 
definite function decaying with distance. This fact brings analogy with the potential theory. 
Each sample is an information particle. The PDF is the information potential field in which 
the information particles interact with each other. In a scalar field, the total potential is 
the summation of potential due to individual particles. The information potential (IP) due 
to the system of samples or the field is given in a same way. For a random variable x, the 
potential on a sample Xj due to other samples, assuming Gaussian kernel, is given by 

1 N 

V 2 (xj ) = — ^2 V 2( x j-,Xi) where, V 2 (xj,Xi) = G ay ^(xj - Xi) 
i— 1 

So, the IP of x is 

1 N N N r 

V 2 (x) d = 52 = jp 5252 V2 (**> = / {/( x ) 2 } 

j=i j =i i=i J 

The quantity V 2 (x) or IP is same as the integration of the square of the PDF. Instead of 
usual sum in potential theory, the normalization is done to get integral over PDF to be 1. 
The subscript of V reminds us that this is the quadratic information potential (QIP) as 
square of the PDF is integrated. The definition is generalized for any a by defining V a as the 
integral of a power of the density. Also, instead of a Gaussian kernel any other kernel can 
be selected. But, they may not have as smooth characteristic as for a = 2 with Gaussian 
kernel. Using this result, ITL theory has defined several scalar descriptors of PDF, that 
just depend upon the available samples with whole PDF structure into consideration. 

The defined in the article, is already defined as QMIed by (Principe, 2010). 

The quantity QMIed > for a random vector x = (x\,x 2 ), in terms of IP is derived as under: 


(Principe, 2010 


QMIed(xi,X 2 ) — Ded(JPx\X2 ("D > x 2 )i Px\ ( x l)Px2 (x 2 )) 

(p x H2 (xi,x 2 ) - p Xl (xi)p X2 ( x 2 )) 2 dxidx 2 



X2 J X\ 



(p xi x 2 (xi,x 2 )) 2 dxidx 2 + 


X2 J X\ 



(~Px! (xi)p X2 {x 2 )) 2 dx 1 dx 2 


X 2 JX\ 



2Px 1 x 2 (xi,x 2 )p Xl (x 1 )p X2 (x 2 )dx 1 dx 2 


' X2 J X\ 

= Vj + Vm ~ 2 Vc 


where, Vj is the IP of the joint PDF, Vm is the potential of the product of the marginal 
PDFs and Vc is the Cross Information Potential (CIP) similar to the concepts of cross 
entropy or cross correlation. 
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The potentials can be estimated through kernel methods. 


N N 

Vj = ^EE^ X W’ X ^)) 

j =1 4=1 
N N 

= jfi EE G ^( x W’ x W) 

3 =1 *=1 
JV AT 

= ^2 EE - ^iCO^v/^OO - ® 2 (j)) 

j=i *=1 
N N 

= ^£E F 2 0*4 (i),xi{j)) V 2 (x 2 (i),x 2 (j)) 
j =1 *=1 


V^M 


( N N \ / N N 

Jp EE G <tV*( X l(*) - *l0')) h^2 E E G ^(*2(») - X 2 {j)) 

3 = 1 i=l / \ i=l 4=1 

^ 2 (xi)V2(x 2 ) 


V C = 


Px lX2 (xi, x 2 )p Xl (xi)p X2 ( x 2 )dxidx 2 


X 2 JX\ 


N 


N 


E G a (x 1 - xi(k))G a (x 2 - x 2 (k)) 


k= 1 


1 * 


i=i 


Af 


AT 


yG a (x 2 ~ x 2 {j)) 


dx\dx 2 


3 =1 

N N N 

^E^E^E / G<J ( Xl “ x iW) G ct(®i - xi(fc))da;i 

4=1 JV j=l 1 k= 1 J 
f 

G cr (x 2 - x 2 (j))G a (x 2 - x 2 (k))dx 2 


at r. at 


JfE 4E G <T^(*i( fc )-*i(*)) 


AT ^ iV 
fc=i L i=i 

N 


N 


N 


y G aV2(M k ) “*2(j)) 


J=1 


l^y 2 (x 1 (/ c ))i/ 2 (.x 2 (fc)) 


fc=i 


B.l Information Forces (IF) 

It is obvious to think of information forces, once defined the IP. Potential and the force are 
related concepts. One of the interpretation of potential is the amount of work done required 
to bring a unit charge or unit mass from infinity to the point in the force field. The particle 
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contains amount of potential energy that has been applied to work against the force. The 
force on sample Xj is the derivative of the IP at a sample with respect to the position of 
sample xj, that is: 


n(*j) = 




N 


X-i — Xi = 


i =1 


N 


Y F 2 (xj-xi) 


i= 1 


^2 ( X * - X j) G *V2( X J ~ X i) 
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